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Abstract 

It is well known that speaker verification systems are subject to 
spoofing attacks. The Automatic Speaker Verification Spoofing 
and Countermeasures Challenge - ASVSpoof2015 - provides 
a standard spoofing database, containing attacks based on syn¬ 
thetic speech, along with a protocol for experiments. This pa¬ 
per describes CPqD’s systems submitted to the ASVSpoof2015 
Challenge, based on deep neural networks, working both as a 
classifier and as a feature extraction module for a GMM and 
a SVM classifier. Results show the validity of this approach, 
achieving less than 0.5% EER for known attacks. 

Index Terms: Speaker Verification, Spoofing Countermea¬ 
sures, Deep Neural Networks 

1. Introduction 

Biometric spoofing is usually described as a direct attack per¬ 
petrated against a biometric authentication system by present¬ 
ing it a fake (forged or copied) biometric sample. Anti-spoofing 
refers, therefore, to countermeasures designed to detect and pre¬ 
vent these attacks H). 

In the last few years, many studies have shown that even 
state-of-the-art automatic speaker verification (ASV) systems 
are vulnerable to such attacks, which can be based on a vari¬ 
ety of techniques, including voice conversion, speech synthe¬ 
sis, artificial signals, impersonation, and replay [ill. Although 
most of these studies proposes countermeasures too, they usu¬ 
ally are based on prior knowledge about the attack method, what 
is clearly unrepresentative of real world scenarios. Additionally, 
each one is also based on its own database, protocol and met¬ 
rics, making it difficult to perform a proper analysis of results 
and restricting fair comparison among them 

The recent Automatic Speaker Verification Spoofing and 
Countermeasures Challenge, ASVSpoof201^ which focused 
on spoofing attacks based on synthetic speech, provided the 
first standard spoofing database along with a protocol for ex¬ 
periments. Differently from previous works, 10 different voice 
conversion and speech synthesis algorithms were used to gener¬ 
ate the database, but only 5 of them were known in advance in 
order to train spoofing detection algorithms O. This paper de¬ 
scribes the systems based on neural networks submitted to the 
challenge and analyze the obtained results. 

Deep Neural Networks (DNN) have been widely used in 
a variety of research fields, such as image classification Hill], 
natural language processing (b) and information retrieval 0 
In the speech processing community, DNN have been applied 
to speech recognition H), speech synthesis 13 COl and also to 
speaker recognition fTTIfT^ . 

%ttp://www.spoofingchallenge.org/ 


One straightforward application of a DNN for spoofing de¬ 
tection is to use it as a classifier, whose input data can be either 
raw audio ca or features previously extracted from the audio 
files. A natural choice for audio pre-processing is to use features 
proven to yield good results in speaker recognition and spoof¬ 
ing detection tasks, such as traditional Mel Erequency Cepstral 
Coefficients (MECC) (141 and Modified Group Delay Cepstral 
Coefficients (MGDCC) (B), which have been broadly used not 
only in combination with neural networks, but also with a hand¬ 
ful of other classification algorithms. 

In problems like spoofing detection, a DNN can also be em¬ 
ployed as a feature extraction module itself, by means of a bot¬ 
tleneck approach ca. In this case, a network, initially trained 
for regression or classification, has its final layers removed, and 
the output of its last remaining layer is used as a new representa¬ 
tion of the input data for future classification fBl . The network 
can receive as input a pre-processed feature vector, a high-level 
full representation of the signal (using, for instance, the East 
Eourier transform) or even the raw audio. In this work, we used 
the high-level representation approach, as described in Section 

m 

The paper is organized as follows: Section|^presents a brief 
description of neural networks. Section [^explains the methods 
applied. Section presents and discusses results obtained on 
the ASVspoof2015 challenge. Einally, Section draw some 
conclusions, as well as points to topics for future research. 

2. Neural Networks 

The submitted systems are based on a Deep Learning approach. 
A deep neural network (DNN) is an artificial neural network 
with more than one hidden neuronal layer between its inputs 
and outputs im The DNN concept can be implemented us¬ 
ing many different architectures, such as Convolutional Neural 
Networks (CNN) (H, Autoencoders CD, and Multilayer Per- 
ceptrons (MLP) GqI. 

In a Multilayer Perceptron, tipically, each neuron j in a hid¬ 
den layer I employs a sigmoid function, such as the logistic 
function or hyperbolic tangent, to map the total input , re¬ 
ceived from the layer I — 1, to an output yj, that is sent to the 
following layer, / + 1. 

l _ tZ I ^ ^ l l — 1 /'1\ 

WijVi ( 1 ) 

yj = logistic{x^j) (2) 

where is the number of neurons in layer I — 1, y\~^ is 

the output of neuron i on previous layer, wlj is the connection 



weight between neuron i from layer I — 1 and neuron j from 
layer /, and 6^ is the bias of neuron j of the current layer (m. 

One of the major DNN applications is for multiclass classi¬ 
fication problems. In this context, a softmax nonlinear function 
can be used in the network output layer to convert inputs 
into a class probability, pj : 


'^i<k<Nout exp(x^^^) 


( 3 ) 


where is the number of neurons in the output layer, which 
is equal to the number of possible classes. In this case, the 
network output pj will indicate the likelihood of the input fed 
to the network belonging to the j-ih class fTTl . 


3. Method 

3.1. Feature Extraction 

Aiming at detecting if an audio is authentic or not, a deep neu¬ 
ral network based on a multilayer perceptron architecture was 
used as a feature extraction module. In a bottleneck approach, 
the network output layer is removed and the activations of the 
last hidden layer neurons are treated as new features for future 
classification. Figure shows how audio was processed, from 
feature extraction to network supervised training. 

Instead of feeding raw signal directly as input to the net¬ 
work, a pre-processing step was performed in order to trans¬ 
form input signals into sequences of feature vectors. This de¬ 
cision was based on preliminary tests, which indicated such a 
step was able to improve the learning rate and allowed the use 
of more compact networks. Therefore, each signal file is di¬ 
vided into a sequence of 20 ms consecutive non-overlapping 
frames. No window function is applied. In parallel, a voice ac¬ 
tivity detection method based on ITU G.729B 121] is applied, so 
each frame is classified as speech/non-speech and only speech 
frames are preserved. 

Different representations were tested as input for the MLP, 
including the raw speech frame itself, MFCC, MGDCC and 
Discrete Fourier Transform (DFT) coefficients. Nevertheless, 
better results were achieved with the Discrete Cosine Trans¬ 
form (DCT) coefficients. The DCT has the energy compaction 
property, which concentrates most of the signal information in 
a few low-frequency components l22l . For this reason, the first 
128 DCT coefficients are used as feature for each active speech 
frame. 

In order to avoid loss of long term information that can pos¬ 
sibly be used to distinguish spoofing attacks, when an input is 
presented to the MLP, each central speech frame is surrounded 
by its ten previous frames and the ten following ones, including 
silence frames CD. Thus, a vector with 2688 features is used 
as network input. 

The backpropagation algorithm, in conjunction with the 
Stochastic Gradient Descent optimization technique 1^ . was 
applied to train the network to classify whether the input rep¬ 
resents an authentic (human) or spoofed audio frame. Ground 
truth consists of a label indicating if the input audio is authentic 
or belongs to one of five spoofing categories, named SI, S2, S3, 
S4 or S5 m. 

Preliminary experiments indicated that using only two 
classes - spoofing and human - as output led to poor perfo- 
mance in class SI. One hypothesis is that this could happen 
because SI distinguishes from other attacks since it is based 
on a unit selection algorithm, which concatenates pieces of au¬ 
thentic signal to create a new audio. To deal with this, it was 



Figure 1: Basic fiowchart used for spoofing detection 


decided to drive the network training towards distinguishing S1 
from the other spoofing attacks, increasing the relevance (on 
network performance) of detecting borders between pieces of 
authentic speech. Thus, three classes were created, as depicted 
in Tableauthentic human speech (100), SI spoofing attack 
(010) and other spoofing attacks (001). 

Figure]^ shows the MLP deep architecture used in this pa¬ 
per. 1024 neurons were used in the first hidden layer, 512 in 
the second hidden layer and 32 in the last one. The last hidden 
layer is artificially small in order to create a bottleneck, which 
compress signal information useful for spoofing classification 
in a low-dimensional representation d. Each hidden layer 
uses the logistic function as activation. The output consists of 3 
neurons, each one with softmax activation function, returning a 
real number between 0 and 1. After finishing the network train¬ 
ing, the output layer was removed and the activations of the last 
hidden layer neurons were used as new output, extracting the 
bottleneck features, as indicated in Figure]^ 










































Table 1: MLP classes output meanings. 


yo 

yi 

y2 

Meaning 

1 

0 

0 

human 

0 

1 

0 

SI attack 

0 

0 

1 

S2, S3, S4, S5 
attacks 


Xq ^2668 



Figure 2: MLP used for feature extraction and classification 


3.2. Classification 

Three different classifiers were tested: Support Vector Ma¬ 
chines (SVM), Gaussian Mixture Models (GMM) and Multi¬ 
layer Perceptron. In the cases of the SVM and the GMM clas¬ 
sifiers, feature extraction took an additional step. Since each 
audio file has a different duration and, thus, a different number 
of frames, feature vectors over all frames were averaged so that 
each file was represented by a single fixed-size 32-dimensional 
feature vector na. 

A SVM classifier OH based on the Radial Basis Function 
(RBF) kernel was generated. Samples from the training set were 
computed and used to train the SVM-RBF. All spoofing attacks 
were considered as a single negative class for training. 

The SVM-RBF classifier parameters C (controls the cost 
of misclassification on the training data) and 7 (parameter of a 
Gaussian kernel to handle nonlinear classification) were tuned 
by performing grid search with K-fold cross-validation over the 
train set, using 5 folds. Values of 0.001, 0.01, 0.1, 1.0, 10.0, 
100 . 0 , 1000.0 and 10000.0 were searched both for C and 7. 
Optimum parameters were chosen aiming at minimizing the av¬ 
erage equal error rate (EER) over all 5 folds. After this search, 
optimum values of C = 0.1 and 7 = 10 were found and the 
SVM-RBE classifier was retrained with the whole training set. 
SVM-RBE outputs vary in the interval [0.0,1.0] and represent 
the likelihood of the test sample belonging to positive class, i.e., 
authentic speech audio. 


Eor the GMM based classifier, two GMMs were trained, 
one with authentic audios and another with spoofed audios. The 
following number of Gaussian mixtures were tested: 4, 8 , 32, 
64, 128, 256 and 512, wherein 8 mixtures gave the lowest EER 
on the development set. The classifier output is given by the 
log-likelihood ratio of authentic GMM with respect to spoofing 
GMM. 

Eigurej^ shows the log-likelihood ratio (score) distribution 
obtained on the development set when a 8 -mixture GMM was 
employed to classify the bottleneck features. Score values vary 
in the interval [—oc, + 00 ] and the higher the value, the higher 
the probability of the tested sample being authentic. The figure 
clearly shows this strategy provided a good separation over the 
develpment set. A similar behavior was verified for the SVM- 
RBE classifier. 


2500 



Eigure 3: Scores distribution for spoofing (green) and authentic 
(blue) audios on the development set when using a GMM with 
8 gaussians and bottleneck features 

The third and last tested approach consisted of using the 
MLP trained for feature extraction directly as a classifier, with¬ 
out the removal of the output layer. In this case, the feature 
extraction was merged with the classification step. 

As the network last layer returns three values using the soft- 
max function, according to presented in Eigure]^ only yo is 
considered, since it represents the likelihood of being an au¬ 
thentic speech. Thus, values for this third approach vary in the 
interval [0.0,1.0]. A score (yo) was then calculated for each 
frame in the audio file, generating a score array for the entire 
audio. This array was used to compute a unique score for the 
audio sample. To do so, aiming at removing outliers within the 
audio file, the first 15% lower array values are removed as well 
as the 25% higher values. The remaining 60% of the scores 
were then averaged, resulting in the final score. 

These three aproaches were, then, applied to the evaluation 
set, which contained samples comprising both known and un¬ 
known attacks. Results are presented in the next section. 

4 . Results and Analysis 

Results obtained for the three tested systems are summarized 
in Table According to challenge rules, the adopted metric 
is the EER. Eor more details on what that means and how it is 
calculated, please refer to the contest evaluation plan [3]. 

It can be seen that: 

• the SVM-RBE classifier showed the best performance 


























Table 2: EER results (%) obtained on development set and on 
evaluation set for known and unknown attacks. 


Classifier 

Dev Set 

Known 

Unknown 

All 

SVM 

0.491 

0.412 

13.026 

6.719 

GMM 

0.658 

0.443 

12.796 

6.620 

MLP 

0.631 

0.464 

12.589 

6.527 


for known attacks, while the unknown attacks were bet¬ 
ter detected by the MLP classifier. However, EER values 
are very close, which means that the choice of the classi¬ 
fier is less determinant for the overall performance than 
the feature extraction mechanism itself. 

• all three systems performed very well for the known at¬ 
tacks, which shows that the network was successful! in 
capturing the pattern of attacks learned during training. 

• most of the unknown attacks were correctly detected; 
however a clear degradation of performance can be ob¬ 
served when error rates of known and unknown attacks 
are contrasted. 

• when considering only the five unknown attacks discrim¬ 
inated by method used (these results are not shown here 
due to space reasons), the proposed method obtained 
good results (EER near to 1%) in three of them. Re¬ 
sults for attacks S8 (a tensor-based voice conversion) and 
SIO (a speech synthesis algorithm implemented using 
the open source MaryTTS system), however, indicate a 
poor performance, with EERs of 26.8% and 31.7%, re¬ 
spectively. 

One hypothesis for the degradation observed in classifiers’ 
performances for evaluation set is the occurence of overfitting to 
noise present in training samples. This situation can be verified 
by the existence of a significant difference in error rates even 
when training and testing samples are drawn from the same dis¬ 
tribution. That is not what the results presented here show, since 
performance in the development set is close to the performance 
for known attacks in evaluation set. 

The second hypothesis is lack of generalization capacity, 
which means that some of the distinctive features learned by the 
network and the classifiers are not related to what distinguishes 
an authentic recording from spoofing attacks in general, but are 
rather due to patterns only observed in the known attack sam¬ 
ples, i.e., specific characteristics of synthesis and conversion al¬ 
gorithms used during training step. 

It was also verified after the submission that many spoofing 
audios available on the training and development sets present 
descontinuity in low frequency noise, mainly in the range 0 
to 100 Hz. Eigure 1^ shows the problem. In this case, as 128 
DCT coefficients was used as DNN input, the first coefficients 
will indicate this discontinuity and the network will learn this 
characteristic as a relevant feature to distinguish authentic from 
spoofing audios, degrading the network’s generalization capac¬ 
ity when audios without this discontinuity are presented. 

Even though some degradation of performance is expected, 
the results obtained show that there is room for improvements, 
since the nature of unknown attacks is not inherently different 
from that of the known ones. 

5. Conclusions 

The study presented here comprises the results obtained, along 
with the description of the systems implemented by CPqD for 



Eigure 4: Low frequency noise discontinuity available on train¬ 
ing and development set (0 to 1000 Hz in vertical axis) 


the Automatic Speaker Verification Spoofing and Countermea¬ 
sures Challenge (ASVSpoof2015), held as a special session in 
INTERSPEECH 2015. The main goal of the challenge was the 
detection of spoofing attacks based on sinthesized and trans¬ 
formed speech. 

A speech feature extraction framework based on deep neu¬ 
ral networks for spoofing detection is presented. The network 
can be used as a classifier itself or can be viewed as a bottle¬ 
neck feature extractor feeding other classifiers. Two different 
classifiers were tested: a Gaussian Mixture Model and a Sup¬ 
port Vector Machine with the radial basis function. 

The proposed systems were trained with the training set and 
tested on two different evaluation sets: one with attacks similar 
to those presented during training and another with unknown 
attacks, just as described in the evaluation plan. 

The use of a DNN as a feature extractor is of particular 
interest, as the generated features are fine-tuned to provide a 
good representation specifically for the problem to be solved, be 
it spoofing detection, speaker/speech recognition or other tasks. 
However, these features are highly dependent on the training 
samples and they can learn any bias present in this set. Thus 
the careful design of large and diverse datasets is even more 
relevant when using this kind of feature. 

Performance for the known attacks was satisfactory 
{EER < 0.5%), indicating the adequacy of the proposed 
strategies. Results obtained for the unknown attacks were also 
promising. Eor some of the new attacks, however, the detection 
strategy had poor performance. This could be easily overcome 
with training data composed by samples generated by a more di¬ 
verse attack techniques. In addition to an improved training set, 
the use of alternative forms of parametrization of the input au¬ 
dio in the neural network could be beneficial. Representations 
that make the speech phase spectrum more evident are specially 
interesting, as the use of such information proved to be highly 
successful in literature for spoofing detection HU. 

Lastly, in future work, other network architectures, like 
Convolutional Neural Networks, should be tested in order to 
study which of them is able to provide better detection of un¬ 
known attacks, an ability extremely relevant in real-world appli¬ 
cations, as rarely the techniques used by fraudsters for identity 
theft are known in advance. 
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